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Abstract 

We propose a general algorithm for approximating nonstandard Bayesian poste- 
rior distributions. The algorithm minimizes the Kullback-Leibler divergence of an 
approximating distribution to the intractable posterior distribution. Our method can 
be used to approximate any posterior distribution, provided that it is given in closed 
form up to the proportionality constant. The approximation can be any distribution 
in the exponential family or any mixture of such distributions, which means that it 
can be made arbitrarily precise. Several examples illustrate the speed and accuracy 
of our approximation method in practice. 



1 Introduction 

In Bayesian analysis the posterior distribution is often of non-standard form. To obtain 
quantities of interest under such a distribution, such as moments or marginal distribu- 
tions, we typically need to use Monte Carlo methods or approximate the posterior with 
a more convenient distribution. A popular method of obtaining such an approximation 
is structured or fixed-form Variational Bayes, which works by numerically minimizing the 
Kullback-Leibler divergence of an approximating distribution in the exponential family to 
the intractable target distribution. For certain problems, algorithms exist that can solve 
this optimization problem in much less time than it would take to approximate the pos- 



terior using Monte Carlo methods (see e.g. Honkela et al. , 2010). However, since these 



methods usually rely on analytic solutions to certain integrals and need conditional con- 
jugacy in the model specification, they are limited in the types of approximations and 
posteriors they can handle. 

We show that solving the optimization problem of fixed-form Variational Bayes is equiva- 
lent to performing a linear regression with the sufficient statistics of the approximation as 
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explanatory variables and the (unnormalized) log posterior density as the dependent vari- 
able. Inspired by this result, we present an efficient stochastic approximation algorithm 
for solving this optimization problem. In contrast to earlier work, our approach does not 
require any analytic calculation of integrals, which allows us extend the fixed-form Varia- 
tional Bayes approach to problems where it was previously not applicable. Our method can 
be used to approximate any posterior distribution, provided that it is given in closed form 
up to the proportionality constant. The type of approximating distribution can be any 
distribution in the exponential family or any mixture of such distributions, which means 
that our approximations can in principle be made arbitrarily precise. While our method 
somewhat resembles performing stochastic gradient descent on the variational objective 
function in parameter space, the linear regression view gives insights which allow a more 
statistically and therefore computationally efficient approach. 

Section [2] introduces fixed-form variational posterior approximation, the optimization prob- 
lem to be solved, and the notation used in the remainder of the paper. In Section [3] we 
re-interpret this optimization problem as a linear regression problem and we propose a 
stochastic approximation algorithm to solve it. We suggest several different ways of im- 
plementing this algorithm, details of which are given in Sections [4] and |5j In Section [6] we 
discuss how to assess the quality of the resulting posterior approximations. Here we also 
present a new method for approximating the marginal likelihood of a model. Our approach 
allows us to use a larger variety of posterior approximations than was previously possible. 
Section [7] gives some guidance on how to choose the type of posterior approximation. Here 
we discuss using mixtures of exponential family distributions and other nonstandard ap- 
proximations. Section [8] shows some examples of using our method in practice. Here we 
show that despite its generality, the efficiency of our algorithm is highly competitive with 
more specialized approaches. Finally, Section [9] concludes. 



2 Fixed-form Variational Bayes 

Let x be a vector of unknown parameters and/or latent random effects for which we have 
specified a prior distribution p(x), and let p(y\x) be the likelihood of observing a given set 
of data y. Upon observing y, we can use Bayes' rule to obtain our updated state of belief, 
the posterior distribution: 

/ | \ p(x,y) p(y\x)p(x) 
p{x\y) = — — — = -p — . (1) 

p(y) J p{y\x)p{x)dx 



An equivalent ( |Caticha and Giffin , 2006) definition of the posterior distribution is 



q[x) 

p{x\y) = argminE 9(:r) log — r = &igmm D[q{x)\p(x\y)} - logp(y), (2) 

q(x) P{X,y) q{x) 

where the optimization is over all proper probability distributions q(x), and where 
D[q(x)\p(x\y)} denotes the Kullback-Leibler divergence between q{x) and p(x\y). The KL- 
divergence is always non-negative and has a unique minimizing solution q(x) = p(x\y) 
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almost everywhere, at which point the divergence is zero. Note that the solution of ^ 
does not depend on the normalizing constant p(y) of the posterior distribution, but that 
we do obtain it as a by-product of solving D[q(x)\p(x\y)} = 0. 

The posterior distribution given in is the exact solution of the variational optimization 
problem in ([2]), but except for certain special cases it is not very useful by itself because 
it is of non-standard form. This means that we do not have analytical expressions for 
the posterior moments of x, or for the marginals p(xi\y) of the multivariate posterior 
distribution, nor can we determine the normalizing constant p(y). One method of solving 
this problem is to approximate these quantities using Monte Carlo simulation. A different 
approach, which we will pursue here, is to restrict the optimization problem in (J2| to a 
reduced set of more convenient distributions Q. If p(x, y) is of conjugate exponential form, 
choosing Q to be the set of factorized distributions q(x) = q{x\)q(x2) . . . q(xk) often leads 
to a tractable optimization problem that can be solved efficiently using an algorithm called 



Variational Bayes Expectation Maximization (VBEM, Beal and Ghahramani 2002). Such 
a factorized solution is attractive because it makes the variational optimization problem 
easy to solve, but it is also very restrictive: it requires a conjugate exponential model and 
prior specification and it assumes posterior independence between the different blocks of 
parameters Xi. This means that this factorized approach can be used with few models, and 



that the solution q(x) may be a poor approximation to the exact posterior (see e.g. Turner 



et al., 2008). 



A different approach to simplifying the variational optimization problem is to restrict the 
solution set Q to only include distributions of a certain parametric form q v (x), where r\ 
denotes the vector of parameters governing the shape of the posterior approximation. This 



approach is known as structured or fixed-form Variational Bayes (Honkela et al. 2010 



Storkey, 2000 Saul and Jordan, 1996). Usually, the posterior approximation is chosen to 



be a specific member of the exponential family of distributions: 

q v (x) = exp[T(x)r/ - U{rj))v{x), 

where T(x) is a 1 x k vector of sufficient statistics, U(rj) takes care of normalization, and 
is(x) is a base measure. The k x 1 vector r) is often called the set of natural parameters of the 
exponential family distribution q v (x). Using this approach, the variational optimization 
problem in ^ reduces to a parametric optimization problem in rj: 

f) = argminE^ (a: )[logg^(a;) - \ogp(x,y)]. (3) 

If our posterior approximation is of a standard form, the ^ q ( x ) logq(x) term in ^ can often 
be evaluated analytically. If we can then also determine log p(x,y), the optimization 
problem can be solved using gradient-based optimization or fixed-point algorithms. Pos- 
terior approximations of this type are often much more accurate than a factorized approx- 
imation, but the requirement that q v {%) is of standard form is still restrictive, as is the 
requirement of being able to evaluate \ogp(x, y). In addition, existing optimization 
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algorithms for fitting this type of approximation can be much slower than the EM type 
algorithms used for factorized approximation, reducing somewhat their advantage with 
respect to Monte Carlo methods. In the next section, we undertake the development of 
an algorithm that can efficiently solve the variational optimization problem for almost any 
type of approximating distribution q v (x) and exact posterior p(x\y). The only require- 
ments we impose on log p(x,y) is that it is given in closed form. The main requirement 
on q v (x) is that we can sample from it. For simplicity, Sections [3] and [4] will also assume 
that q v (x) is in the exponential family. Section [7] will then show how we can extend this to 
include mixtures of exponential family distributions. By using these mixtures and choosing 
q v (x) to be of a rich enough type, we can in principle make our approximation arbitrarily 
precise. 

3 Variational Bayes as linear regression 

For notational convenience we will use an unnormalized version of the approximating dis- 
tribution: 

qfi(x) = exp[f(x)fj]is(x), 

where we have removed the normalizer U(j]), and have added a constant to the vector of 
sufficient statistics, i.e. T(x) = (1, T(x)) and fj = (r] ,r]')'. 

The unnormalized version of the KL-divergence is given by 

/Q~ { X ) / 
Qf,{x) log -g- -r dv(x) - J q n {x)dv(x) (4) 

exp[T(x)fj][T(x)fj — log p(x,y)]dv(x) — / exp[T(x)fj]du(x) (5) 



At the minimum this gives (see Appendix |B|) t/q = K q logp(x,y) — logq(x), which is the 
usual bound on the log evidence. The other parameters, t] have the same minimum as in 
the normalized case. 

Taking the gradient of Q with respect to the natural parameters, fj we have 

V fj D[q fj (x)\p(x,y)] = J q n (x)[f(x)'f(x)fj - f(x)' logp(x, y)]dv{x). (6) 
Setting this expression to zero in order to find the minimum gives 



V 



qfj(x)T(x)'T(x)di , (x) 



(7) 



qfj(x)T(x)' logp(x, y)dv{x 

or in its normalized form 

f) = [E q f (x)'f (x)]-%f (x)' logp{x, y) (8) 



4 



Note that we have implicitly assumed that the Fisher information matrix, K q T(x)'T(x) is 
non-singular, which will be the case for any identifiable approximating exponential family 
q. Our key insight is to notice the similarity between Q with the maximum likelihood 
estimator for linear regression. Recall that in classical linear regression we have that the 
dependent variable {y n 6R:n = l,.., N} is distributed as N(Y\X/3, a 2 I) where X is the 
N x D design matrix, (3 is the D x 1 vector of regression coefficients and a 2 is the noise 
variance. The maximum likelihood estimator for /3 is then 

= {X'X)- l X'Y (9) 

Alternatively, the estimator in (|9j may be seen as the Bayesian posterior mean for f3, 
obtained with the Jeffreys prior on 0, a 2 . Note that this optimal estimator uses the re- 
alization of the X'X matrix, rather than its expectation KX'X, even if the latter were 



available. This observation will be relevant in Section pTi] where we consider stochastically 
approximating ([8]). To see the relation between ^ and ([9]), associate the design matrix 
X with the sufficient statistics T, the dependent variable Y with the unnormalized log 
posterior \ogp(x,y) and the regression coefficients /3 with the vector of natural parameters 
fj. Finally if we consider Monte Carlo estimates of the expectations in ^ then the analogy 



is very fitting (see (11) below). 



3.1 Choosing an estimator 

We will consider three different MC estimators for approximating ([8]). The first separately 
approximates the two integrals and then calculates the ratio: 



with S the number of Monte Carlo samples. The second approximates both integrals using 
the same samples from q 

a Yl T(x s )'f(x s ) J ^^2f(x s )'\ogp(x s ,y), x s ~ iid q(x) (11) 



V2 



Note that only this estimator is directly analogous to the linear regression estimator. The 
third estimator is available only when the first expectation is available analytically: 

f) a = E q f(x)'f(x) -^2f(x s )'logp(x s ,y), x s ^ iid q(x) (12) 

s 

We wish to understand the bias/variance tradeoff inherent in each of these estimators. To 
keep notation manageable consider the case with only k — 1 sufficient statistic^ and let 

a(x) = f{x)' f{x) = f(x) 2 (13) 
b(x) = T(x) log p(x,y) (14) 



1 These results extend in a straightforward if laborious manner to the case where k > 1 
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We can now write the three estimators of r) more concisely as 
i ^2 b(x r ) 

fji = i „ r - — -, x r ,x s ~ iid q {x) (15) 

7? a = ^ — , x s ~ iid q(x) (17) 

Ea 



Using a simple Taylor series argument it is straightforward (see Appendix [A| to approxi- 
mate the bias of these estimators: 

bias^) « ^™ (18) 

biasfo^ « Var ( Q ) E [ fe ] _ Cov ( a ' fe ) fig) 
Dias^ 2 J ~ SE[a]3 SE[a] 2 [i-J) 

Note that the first term is shared, but the first estimator does not have the covariance term 
as a result of the independent sampling in approximating the numerator and denominator. 
In contrast fj a is unbiased. Now consider the variances 

/ ^ \ 1 /(E6) 2 Var(a) Var(6)\ . . 

Var(r}i) « ^ 1 ' / ; + 7^ (20) 



S V (Ea) 4 (Ea) 2 / 

All three estimators have the same final term (the variance of the "analytic" estimator). 
Again the second estimator has an additional term resulting from the covariance between 
a and b which we find is typically beneficial in that it results in the variance of fj being sig- 
nificantly smaller. It is worth recalling that the mean squared error (MSE) of an estimator 
is given by 

E[(?y - fi) 2 } = Vax(fj) + bias(r)) 2 (23) 

Since both the variance and bias are 0(1/ S), the variance contribution to the MSE is 
0(1/ S) whereas the bias contribution is 0(1/ S 2 ), so the variance is actually a greater 
problem than the bias. From these expressions it is still not immediately obvious which 
estimator we should use. However, consider the case when the target distribution p is in 
the same exponential family as q, i.e. when \ogp(x,y) = T(x)X. It is then straightforward 
to show that 

, , A * AVar(T 2 ) Ar , A , A 2 Var(f 2 ) . , 

bias(m) « = — '-, Var(T)i) « 2 ^ — '- 24 

V ' SE[T 2 } 2 K J SE[T 2 ] 2 V ' 

bias(r/ 2 ) « 0, Var(r} 2 ) « (25) 



■2\ 



bias(r) a ) = 0, Var(r} a ) = ' (26) 
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We see that in this case for 572 the positive and negative contributions to both the bias and 
variance cancel. While this result will not hold exactly for cases of interest, it suggests that 
for exponential families which are capable of approximating p reasonably well, 172 should 
perform significantly better than fjx or even fj a . In this setting it is actually possible to 
see that 7)2 will in fact give the exact solution in k + 1 samples (with k the number of 
sufficient statistics), while the other estimators have non- vanishing variance for a finite 



number of samples. This means that the approximate equality in (25) can be replaced by 
exact equality. Using k + 1 samples Xi,i = 1, k + 1, assumed to be unique (which holds 
almost surely for continuous distributions q), we have 



'k+l 



k+1 



V2 



A 



(27) 



i=l 



That is, the algorithm has recovered p(x, y) exactly with probability one. If we assume we 
know how to normalize q, this means we also have p(x\y) exactly in this case. Note that 
we recover the exact answer here because the p(x, y) function evaluations are in themselves 
noise free, so the regression analogy really corresponds to a noise free regression. 



It is instructive to consider a toy example: fitting an exponential distribution p{x) = 
Xe~ Xx , about the simplest possible demonstration of the exact fitting phenomenon shown 
We assume that we are unaware that p happens to be normalized. 
[l,x]' and rate rj, i.e. q(x) 



in (27) 



approximation has T 
it is straightforward to calculate 



r/e 



-IjX 



Our variational 
Note that is an example where 



E q f(x)'f(x) 



1 

-77- 



-rf 



rf 



We test the three estimators in (10), (11) and (12) when the true exponential rate is 
A = 1.5, and sampling from the optimal q distribution with rj = 1.5. The results confirm 
that r} 2 finds the exact rate using just S = 2 MC samples, as predicted by (27). We 



would expect r) a to be unbiased, and this is borne out by the results shown in Figure [T] 
The estimator fji has both poor bias and such large variance that it often gives an invalid 
negative rate if fewer than 10 MC samples are used. While this is clearly a very simple 
example it hopefully emphasizes the potential benefit to be gained from using estimators 
related to 7)2. 
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eta 1 
eta 2 

eta analytic 1 



10 1 10 2 10 3 10 4 

# samples 

Figure 1: Comparison of three estimators for fitting a variational posterior q to a simple ex- 
ponential distribution p. 50 repeats were used to estimate the mean and variance of 
the estimator: the thick line shows the mean and the thin lines show ± one standard 
deviation. The x-axis indicator the number of MC samples, S, used. As expected in 
this case f/2 gives the correct solution of 1.5 using S > 2 samples. 



3.2 A stochastic approximation algorithm 

We have seen that the linear regression view of variational inference inspires a statistically 
efficient estimator for the natural parameters of the variational approximation in the form 



of (11). The question remains how we should turn this estimator into an automated 
algorithm. Note that in the general case where q and p are of different forms, the linear 
regression in equation (j8j) only gives the optimal fj parameters when we sample from the 
optimal approximation q^. Since we do not know this optimal approximation before having 
done the optimization, this creates a chicken-and-egg problem. The obvious solution would 
be to use an arbitrary initial distribution q, and use the optimality condition in ^ as a 
fixed-point equation to be iterated until converge, which would give updates of the form 

Vt+1 = \E qim) f(x)'f(x)]-% {vt) f(x)' \ogp(x, y). (28) 

However, such an approach is not guaranteed to converge, and in practice we find it often 
does not. Fortunately, the difference r]t+i—r]t generated by this approach can be interpreted 
as the negative approximate natural gradient of the Kullback-Leibler divergence Q (see 
Appendix D). This means that we can obtain a convergent algorithm by taking small 



enough steps in the direction indicated by (28). 



Based on this insight, we propose solving the variational optimization problem in (|8j) using 
the following stochastic optimization algorithm. The basic idea is to sample one MC 
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point at a time, updating the current estimate of q as our estimate of fj improves. Thus 
samples drawn later in the process will be closer to being samples from the optimal q. This 
approach allows us to efficiently approximate the statistics used in the regression of (|8]), 
while adapting i] slowly enough to ensure convergence of the algorithm. 



Algorithm 1 Stochastic Optimization for Fixed-Form Variational Bayes 

Require: An unnormalized posterior distribution p(x, y) 
Require: A type of approximating posterior q v (x) 
Require: The total number of iterations N 

Initialize rj to a first guess, for example by matching the prior p(x) 

Initialize C = E g T{x)'T{x), or a diagonal approximation of this matrix 

Initialize g = Cr\ 

Initialize C — 

Initialize g = 

Step-size w = l/\fN 

for t = 1 : N do 
Set 7] = C- X g 

Generate g t = unbiased estimate of K qri T(x)' \ogp(x, y) 
Generate C t = unbiased estimate of K qri T(x)'T(x) 
Set g = (1- w)g + wg t 
Set C = (1 - w)C + wC t 
if t > N/2 then 

Set g = g + g t 

Set C = C + C t 
end if 
end for 

return fj = C~ x g 



This algorithm is inspired by a long line of research on stochastic approximation, starting 
with the seminal work of Robbins and Monro (1951). However, contrary to most appli- 
cations in the literature, we use a fixed step size w = l/x/iV rather than a declining one 
in updating our statistics. The analyses of Robbins and Monro (1951) and Amari (1997) 
show that a sequence of learning rates w t = ct -1 is asymptotically efficient in stochastic 
gradient descent as the number of iterations N goes to infinity, but this conclusion rests 
on very strong assumptions on the functional form of the objective function (e.g. strong 
convexity) that are not satisfied for the problems we are interested in. Moreover, with 
a finite number of iterations N, the effectiveness of a sequence of learning rates that de- 
cays this fast is highly dependent on the proportionality constant c. If we choose c either 
too low or too high, it may take an extremely long time to reach the efficient asymptotic 
regime of this learning rate sequence. Nemirovski et al. (2009) show that a more robust 

1/y/N and that this is optimal for finite N 
In order to reduce the 



approach is to use a constant learning rate w 

without putting stringent requirements on the objective function 
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variance of the last iterate with this non- vanishing learning rate, they propose to use an 
average of the last L iterates as the final output of the optimization algorithm. The value 
of L should grow with the total number of iterations, and is usually chosen to be equal to 
N/2. Remarkably, they show that such an averaging procedure can match the asymptotic 
efficiency of the optimal learning sequence wt = ct~ l . For our particular optimization 
problem we have observed excellent results using constant learning rate w — 1/ y/N, and 
averaging starting half-way into the optimization. Note that we perform this averaging 
on the statistics g and C, rather than on the parameters rj = C~ l g, which is statistically 
more efficient for our application. Using this set-up, g and C are actually weighted MC 
estimates where the weight of the j-th MC sample during the t-th iteration (j < t) is given 
by w(l — iw)* -3 '. Since w G (0, 1), this means that the weight of earlier MC samples declines 
as the algorithm advances, which is desirable since we expect q to be closer to optimal later 
in the algorithm's progression. 

If the initial guess for rj is very far from the optimal value, or if the number of steps N is 
very small, it can sometimes occur that the algorithm proposes a new value for rj that does 
not define a proper distribution. We can guard against this by adapting the regression step 
7] = C~ 1 g to have additional shrinkage towards the current value of r\ by using 

with A a positive diagonal matrix of appropriate scale. The scale of A can be made to 
depend on the sampled g t , C t to prevent very large moves in rj: as long as the sampled 
statistics are still averaged in the usual way this does not bias the algorithm. 

Alternatively we could just reduce the step-size or increase the number of steps: One can 
show that the algorithm above becomes a pre-conditioned gradient descent algorithm as 
the number of steps goes to infinity (see Appendix D), which means that the algorithm 
is guaranteed not to diverge if the step size is small enough. In addition, note that the 



exact convergence result of equation (27) suggests that divergence is very unlikely if q n (x) 
and p(x, y) are close in functional form: choosing a good type of approximation will thus 
also help to ensure fast convergence. Finally, picking a good first guess for rj also helps 
the algorithm to converge more quickly. For very difficult cases it might therefore be 
worthwhile to base this guess on a first rough approximation of the posterior, for example 
by choosing r\ to match the curvature of logp(x, y) at its mode. For all our applications we 
found that a simple first guess for r\ and a large enough number of iterations was sufficient 
to guarantee a stable algorithm. 

Finally, it is worth mentioning the calculation of C~ l g. Note that by using conditional 



independencies in the exact posterior (Section 4.2) and the structure of the approximation 
(Section [5]), we can often avoid computing and storing the full C matrix, which can be quite 
high dimensional. In those rare cases where we do use the full C matrix, computing C^ 1 
explicitly (which is 0(K 3 )) is not recommended, but if desired then C _1 should be updated 
each iteration using rank-one updates (i.e. using the matrix inversion lemma) which cost 
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0(K 2 ). Similar low cost updates could be used to maintain Cholesky decompositions since 
C is symmetric, which is a numerically stable and efficient option. In very high dimensions 
one could use conjugate gradients to solve C~ 1 g approximately. 



4 Simulating the regression statistics 

In order to implement our algorithm we need to be able to construct unbiased estimates 
of E q T(x)'logp(x,y) and E q T(x)'T(x). It turns out that this is easy to do as long as 
q v {x) is easy to sample from. In fact, there are multiple ways of constructing stochastic 
estimates §t and C t with these expectations. We will derive several here. 



4.1 Basic stochastic approximation 



Following the results of Section 3.1 our default stochastic approximate of E 97) T(x)' \ogp(x, y) 
and E ? T(x)'T(x) is to simply draw a sample x* from q v (x) and to use this sample to cal- 
culate 

g t = f(x*y\ogp(x*,y) (29) 
C t = f(x*)'f(x*) (30) 



This works remarkably well because, as Section 3.1 explains, using the same random draw 



x* to form both estimates, part of the random variation in rj = C l g cancels out. 
4.2 Making use of conditional independencies 

For most statistical problems, the log posterior can be decomposed into a number of addi- 
tive factors, i.e. \ogp{x,y) = J2f=i ^°SPj( x : u)- The optimality condition in equation (8) 
can then also be written as a sum: 

JV 

77 = 5}E 9 f (x)'f (x^Egf (xyiog Pj (x,y) 
j=i 

This means that rather than performing one single linear regression we can equivalently 
perform N separate regressions. 

N 

v = ( 31 ) 

rf = lE q f(xyf(x)]- 1 E q f(x)'log P:j (x,y) (32) 

The practical benefit of this is that these separate regressions are often of much lower 
dimension: We know that element i of fjj will only be non-zero if the z-th sufficient statistic 
Ti(x) has non-zero partial correlation to logpj(x,y). Since the separate factors logpj(x,y) 
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often involve only a subset of the variables in x, this means that we can omit many of the 
sufficient statistics in performing each regression. That is, we have 

ff R = [E g T K (x) / T JJ (a;)]~ 1 E (? T R (a;)'logp j (a;,?/) 

with Tr(x) the relevant subset of T(x), and ff R the corresponding subset in fj J . The 
remaining elements in ff will be zero. By performing these lower dimensional regressions 
we can reduce the variance of the stochastic approximation algorithm, as well as reduce 
the overhead needed to store and invert K q T(x)'T(x). 

4.3 Using the gradient of the log posterior 

It is straightforward to split the optimality condition in ([8]) into a condition for the nor- 
malizer t)q and for the original vector of natural parameters 77. Following Appendix [Bj the 
optimality condition for the original parameters of the normalized g-distribution is given 
by 

fj = Cov q [T(x),T(x)}- 1 Cov q [T(x),\ogp(x,y)}. (33) 
Furthermore, using the properties of the exponential family of distributions, we know that 



CovJT(x), T{x)} = V v E qv T(x) (34) 

and 

Cov q [T(x), \ogp(x, y)} = logp(:r, y) (35) 

Both expectations can be approximated without bias using Monte Carlo. By differenti- 
ating these Monte Carlo approximations we can then obtain unbiased estimates of their 
derivatives. This is easy to do as long as the pseudo-random draw x* from q v (x) is a 
differentiable function of the parameters 77, given our random number seed z*. 

x * = f(.Vi z *)i with z* such that x* ~ q v (x) (36) 
g = V v logp{f( V ,z*),y) (37) 
C = V v T(f( V ,z*)) (38) 

By using the same random number seed z* in both Monte Carlo approximations we once 



again get the beneficial variance reduction effect described in Section |3.1| Empirically, we 
find that using gradients often leads to a more efficient stochastic optimization algorithm. 
For some applications the posterior distribution will not be differentiable in some of the 
elements of x, for example when x is discrete. In that case the stochastic approximations 



presented here may be combined with those of Section 3.1 



Furthermore, we can decompose logp(f(r], z*), y) = V r) /(?7, z*)V x logp(x, y) evaluated 
at x = f(r],z*), and equivalently V^T (/(??, z*)) = V v f (i], z*)V x T(x) . Note that for many 
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samplers V^/(r/, z*) is not defined, e.g. rejection samplers. In that case we can replace 
this quantity by the equivalent expression for a sample from an inverse-transform sampler: 



with the CDF and (j)r,{x*) the pdf of the sampler. 



4.4 Using the Hessian of the log posterior 



When we have both first and second order gradient information for \ogp{x, y) and if we 
choose our approximation to be multivariate Gaussian, i.e. q v (x) = N(m(r]),V(r))), we 
have a third option for approximating the statistics used in the regression. For Gaussian 



q(x) and twice different iable log p(x,y), Minka (2001b) and Opper and Archambeau (2009) 
show that 



V m E q logp(z, y) = E q V x logp(x, y) 



and 



V v E q logp(x,y) = ^E q V x V x log p(x,y) 



(39) 



(40) 



where Va;V x logp(x, y) denotes the Hessian matrix of logp(x, y) in x. 



For the multivariate Gaussian distribution we know that the natural parameters are given 
as rji = V~ l m and 772 = V~ x . Using this relationship, we can derive Monte Carlo estimators 
g and C using the identities ([34J35). We find that these stochastic approximations are 
often even more efficient than the ones in Section 4J3, provided that the Hessian matrix of 
log p(x,y) can be calculated cheaply. 



4.5 Using analytic expectations where possible 

In many cases it is possible to calculate the contributions of some of the factors log pi(x) 
to the stochastic approximations C and g analytically, while for others it is not. For 
example, this occurs when part of logp(x) (most often the prior) is conjugate to the 
posterior approximation q v (x). Even for some non-conjugate factors it might be possible 
to calculate certain expectations analytically. Using these exact expectations rather than 
their stochastic estimates can help reduce the variance of the approximations as well as 
reduce the time required to compute them, both of which increase the efficiency of the 
optimization procedure. 



4.6 Subsampling the data: double stochastic approximation 

The stochastic approximations derived above are all linear functions of logp(x) and its first 
and second derivatives. This means that these estimates are still unbiased even if we take 
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logp(x) to be a noisy unbiased estimate of the true log posterior, rather than the exact log 
posterior. For most statistical applications \ogp(x) itself is a separable additive function 
of a number of independent factors, i.e. logp(x) = J2f=i l°gPi( x )- These logpj(x) terms 
can be the likelihood contributions of individual observed data points, but they can also 
arise through conditional independencies between the x variables in the posterior. Using 
this fact we can construct an unbiased stochastic approximation of \ogp(x) as 



\ogp(x) 



N 
K 



K 



^logpj-i 



X 



(41) 



where the K factors \ogpj(x) are randomly selected from the total N factors. This approach 
was previously proposed for online learning of topic models by Hoffman et al. (2010). Since 



\ogp(x) has \ogp(x) as its expectation, performing stochastic approximation based on p(x) 
converges to the same solution as when using p(x), provided we resample the factors in 
\ogp(x) at every iteration. By subsampling the K N factors in the model the individual 
steps of the optimization procedure become more noisy, but since we can calculate p(x) 
faster than we can p(x), we can perform a larger number of steps in the same amount 
of time. If the number of factors in the posterior is especially large, this tradeoff often 
favors using subsampling. This principle has been used in many successful applications of 



stochastic gradient descent, see e.g. Bottou (2010). 



5 Linear transformations of the regression problem 

It is well known that classical linear least squares regression is invariant to invertible linear 
transformations of the explanatory variables. More precisely, when we have an iV x D 
matrix of explanatory variables X, and an N x 1 vector of dependent variables Y, we may 
equivalently perform the linear regression using a transformed set of explanatory variables 
X = XK', with K any invertible matrix of size D x D. The least squares estimator 
(3 = (X'X)~ l X'Y of the transformed problem can then be used to give the least squares 
estimator of the original problem as /3 = K fi: 

P = K\KX'XK'Y X KX'Y = {KX'Xy l KX'Y = (X'X^X'Y. 

Using the same principle, we can rewrite the optimality condition of equation (J8]) as 

fj = [E qv K{r,)f{x)'f{x)]- l E qn K{ri)f{x)' hgp(x, y) (42) 

for any invertible matrix K, which may depend on the variational parameters rj. Instead 
of solving our original least squares regression problem, we may thus equivalently solve a 
transformed version of it as defined by any invertible linear transformation K(rf). When 



we perform the linear regression of equation (42) for a fixed set of parameters r], the result 



will obviously be identical to that of the original regression with K(rj) = I, as long as we 
use the same random numbers for both regressions. However, when the 'data points' in 
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our regression are generated using different values of T], as is the case with the proposed 
stochastic approximation algorithm, the two regressions will not necessarily give the same 
solution. If the true posterior p(x\y) is of the same functional form as the approximation 
q^, the exact convergence result of Section [3] holds for any invertible K(r)), so it is not 
immediately obvious which K(rj) is best for general applications. 

We hypothesize that certain choices of K(rj) may lead to statistically more efficient stochas- 
tic approximation algorithms for certain specific problems, but we will not pursue this idea 
here. What we will discuss is the observation that the stochastic approximation algorithm 
may be easier to implement for some choices of K(rj) than for others, and that the com- 
putational costs are not identical for all K(rj). In particular, it is worth noting that the 
transformation K(rj) allows us to use different parameterizations of the variational approx- 
imation. Let be such a reparameterization of the approximation, let the new parameter 
vector (f>{rj) be an invertible and differentiate transformation of the original parameters i], 
and set K{rj) equal to the inverse Jacobian of this transformation, i.e. K(rj) = [V J? 0(?7)]~ 1 . 
Using the properties of the exponential family of distributions, we can then show that 

K(rj) Cov 9 jT(x), h(x)} = V^x) (43) 

for any differentiate function h(x). Using this result, the stochastic approximations of 



Section 4.3 for the transformed regression problem are found to be 



x * — with z* such that x* ~ q<j>{x) (44) 

g = V*logp(/(0,**),j/) (45) 
6 = V T(/(0,^)). (46) 



These new expressions for g and C may be easier to calculate than the original ones (36), 
and the resulting C may have a structure making it easier to invert in some cases. A 
particularly striking example of this occurs when we use a Gaussian approximation in 



combination with the stochastic approximations of Section 4.4 , using the gradient and 
Hessian of \ogp{x, y). In this case we may work in the usual natural parameterization, 
but doing so gives a dense matrix C with dimensions proportional to p 2 , where p is the 
dimension of x. For large p, such a stochastic approximation is expensive to store and 
invert. However, using the stochastic approximations above, we may also parameterize 
our approximation in terms of the mean m and variance V, and work with these param- 
eters directly. Doing so leads to the following sparse regression algorithm, as derived in 
Appendix O 
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Algorithm 2 Stochastic Approximation for Gaussian Variational Approximation 



Require: An unnormalized, twice different iable posterior distribution p(x,y) 
Require: The total number of iterations N 

Initialize the mean and variance of the approximation (m, V) to a first guess, for example 

by matching the prior p(x) 

Initialize z = m, P = V' 1 and a = 

Initialize z — 0, P = and a = 

Step-size w — 1/ \/N 

for t = 1 : N do 

Set V = P -1 and m = Va + z 
Generate a draw x* from N(m, V) 

Calculate the gradient g t and Hessian H t of logp(x, y) at s* 
Set a = (1 — w)a + wg t 
Set P = (1 - w)P - wH t 
Set z = (1 — w)z — «)x* 
if t > N/2 then 

Set a = a + g t 

Set P = P - H t 

Set z = z + x* 
end if 
end for 

Set V = P -1 and m = Va + z 
return m, V 



Instead of storing and inverting the full C matrix, this algorithm uses the sparsity induced 
by the transformation K{rj) to work with the precision matrix P instead. The dimensions of 
this square matrix are equal to p, rather than its square, providing great savings. Moreover, 
while the C matrix in the original parameterization is always dense, P will have the same 
sparsity pattern as the Hessian of log p(x,y), which may reduce the costs of storing and 
inverting it even further for many applications. An example of this algorithm can be found 
in Section I8TT1 



6 Marginal likelihood and approximation quality 



To fit our posterior approximation we minimize the Kullback-Leibler divergence between 
q v (x) and p(x\y), given by 



D(q v \p) = E gv log 



p(x\y) 



E, 



log 



p(x,y) 



+ log p(y), 



which shows that we need to know the marginal likelihood p(y) (the normalizing constant 
of the posterior) in order to evaluate this Kullback-Leibler divergence. As discussed before, 
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we do not need to know this constant in order to minimize D(q v \p) as p(y) does not depend 
on 7], but we do need know it if we want to determine the quality of the approximation, 
as measured by the final KL-divergence. In addition, the constant p(y) is also essential for 
performing Bayesian model comparison or model averaging. 

When our algorithm has converged, we have the following identity 

log p(x,y) = 170 + log q v (x) +r(x), 

where r(x) is the 'residual' or 'error term' in the linear regression of log p(x,y) on the 
sufficient statistics of q v {x). The intercept of the regression is 

770 = E qv [logp(x, y) - log q v (x)\ , 

the usual VB lower bound on the marginal likelihood. Exponentiating this term yields 

p(x,y) = exp(rja)q v (x)exv(r(x)), 

which we need to integrate with respect to x in order to find the marginal likelihood p(y). 
Doing so gives 

P(y) = ex P(^o)E^ exp(r(x)). (47) 
At convergence we have that K qn r(x) = 0. Jensen's inequality then tells us that 
E qv exp(r(x)) > 1, 

which shows that 170 is indeed a lower bound on the log marginal likelihood as we claimed 
earlier. If our approximation is perfect, the KL-divergence is zero and r(x) is zero almost 
everywhere. In that case the residual term vanishes and the lower bound will be tight, 
otherwise it will underestimate the true marginal likelihood. The lower bound fj is often 
used in model comparison, which works well if the KL-divergence between the approximate 
and true posterior distribution is of approximately the same size for all models that are 
being compared. However, if we compare two very different models this will often not be 
the case, and the model comparison will be biased as a result. In addition, as opposed to 
the exact marginal likelihood, the lower bound gives us no information on the quality of 
our posterior approximation. It would therefore be useful to obtain a better estimate of 
the marginal likelihood. One approach to doing this would be to evaluate the expectation 



in (47) using Monte Carlo sampling. Some analysis shows that this corresponds to approxi- 
mating p(y) using importance sampling, with q v (x) as the candidate distribution. It is well 
known that this estimator of the marginal likelihood may have infinite variance, unless r(x) 
is bounded from above. In general, we cannot guarantee the boundedness of r(x) for our 



approach, so we will instead approximate the expectation in (47) using something that is 
easier to calculate. At convergence, we know that the mean of r(x) is zero when sampling 
from q v (x). The variance of r(x) can easily be estimated using the mean squared error 
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of the regressions we perform during the optimization, with relatively low variance. We 
denote our estimate of this variance by s 2 . The assumption we will then make in order to 
approximate log p(y) is that r(x) is approximately distributed as a normal random variable 
with these two moments. This leads to the following simple estimate of the log marginal 
likelihood 

log p{y) « 170 + ^s 2 . 

That is, our estimate of the marginal likelihood is equal to its lower bound plus a correction 
term that captures the error in our posterior approximation q v {x). Similarly, we can 
approximate the KL-divergence of our posterior approximation as 

D(q v \p) « \s 2 . 

The KL-divergence is approximately equal to half the mean squared error in the regression 
of \ogp{x, y) on the sufficient statistics of the approximation. This relationship should not 
come as a surprise: this mean squared error is exactly what we minimize when we do a linear 
regression. Our experiments indicate that this approximation of the KL-divergence can be 



quite accurate (see Section 8.3), especially when the approximation q v (x) is reasonably 
good. 

Note that the scale of the KL-divergence is highly dependent on the amount of curvature 
in log p(x\y) and is therefore not easily comparable across different problems. If we scale 
the approximate KL-divergence to account for this curvature, this naturally leads to the 
R-squared measure of fit for regression modeling: 

R 2 = 1 - 



Var ? [logp(x, y)] 

The R-squared measure corrects for the amount of curvature in the posterior distribution 
and is therefore comparable across different models and data sets. In addition it is a well- 
known measure and easily interpretable. We therefore propose to use the R-squared as the 
measure of approximation quality for our variational posterior approximations. 



7 Using mixtures of exponential family distributions 

So far, we have assumed that the approximating distribution q v (x) is a member of the 
exponential family. Here we will relax that assumption. If we choose a non-standard 
approximation, this most likely means that certain moments or marginals of q n {x) are no 
longer available analytically, which should be taken into account when choosing the type of 
approximation. However, if we can at least sample directly from q n {x), it is often still much 
cheaper to approximate these moments using Monte Carlo than it would be to approximate 
the corresponding moments of p(x\y) using MCMC or other indirect sampling methods. 
We have identified two general strategies for constructing useful non-standard posterior 
approximations which are discussed in the following two sections. 
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7.1 Hierarchical approximations 

If we split our vector of unknown parameters x into p non-overlapping blocks, our approx- 
imating posterior may be decomposed as 

q(x) = q(x 1 )q(x 2 \x 1 )q(x 3 \x 2 , x x ) . . . q(x p \x p - 1 , . . . , Xi). 

If we then choose every conditional posterior Xj_2, • . . ,Xi) to be of a standard 

form, we can easily sample from the joint q(x), while still having much more freedom 
in capturing the dependence between the different blocks of x. In practice, such a con- 
ditionally standard approximation can be achieved by specifying the sufficient statistics 
of each standard block Xj_ 2 , . . . , x\) to be a function of the preceding elements 

Xi-i, Sj__ 2 , . . . ,x±. This leads to a natural type of approximation for hierarchical Bayesian 
models, where the hierarchical structure of the prior often suggests a good hierarchical 
structure for the posterior approximation. 

If every conditional Xj_2, . . . ,Xi) is in the exponential family, the joint may not 

be if the normalizing constant of q(xi\xi-\,Xi-2, ■ ■ ■ ,Xi) is a non-separable function of 
Xi-i, Xi-2, ■ ■ ■ , x± and the variational parameters. However, because the conditionals are 
still in the exponential family, our optimality condition still holds separately for the varia- 
tional parameters of each conditional with only slight modification. In that case we there- 
fore propose applying the optimization procedures separately to each of these conditionals. 
Without loss of generalization, consider the case where our posterior approximation con- 



sists of two factors: q{x) = q m (xi)q m (x2\xi) . In its normalized form (see Section 4.3), the 
optimality condition for the first factor is then given as 

771 = Var g [T(si)] _1 Cov q [T (xi), log p(x,y) - logg r)2 (x 2 |a;i)], 

where T(x\) denotes the sufficient statistics of q Vl {xi)- The optimality condition for the 
second block is 

r] 2 = [E q{xi) VaT q{x2lxi) (T(x 2 ))]~ 1 [E q(xi) Cov q{x2lxi) (T(x2)^ogp(x,y^ - log q m (xi))], 
where T(x 2 ) denotes the sufficient statistics of g^ 2 (x 2 |xi). By making use of the conditional 



independencies discussed in Section 4.2 we can often simplify these expressions further for 
given problems. 

Using this type of approximation, the marginals q(x{) will generally be mixtures of standard 
distributions, which is where the added flexibility of this method comes from. By allowing 
the marginals q{xj) to be mixtures with dependency on the preceding elements of x, we 
can achieve much better approximation quality than by forcing them to be of a standard 



form. A practical example of this in a hierarchical Bayesian model is given in Section 8.2 
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7.2 Using auxiliary variables 



Another method of constructing flexible posterior approximations is by using the condi- 



tionally standard approximation of Section 7.1, but by letting the first block of variables 



be a vector of auxiliary variables z, that are not part of the unknowns x. Doing this, the 
posterior approximation has the form 

q(x, z) = q(z)q(x\z). 

The factors q(z) and q(x\z) should both be of standard form, which allows the marginal 
approximation q(x) to be a general mixture of exponential family distributions, like a 
mixture of normals for example. If we use enough mixture components, the approximation 
q(x) could then in principle be made arbitrarily close to p(y\x). This mixture approximation 
can then be fitted by performing the standard KL-divergence minimization: 

f\ = argminE 3 Jlog^(x) - log p(x,y)} (48) 
v 



From (48) it becomes clear that an additional requirement of this type of approximation 
is that we can integrate out the auxiliary variables z from the joint q(x, z) in order to 
evaluate the marginal density q(x) at a given point x. Fortunately this is easy to do 
for many interesting approximations, such as discrete mixtures of normals or continuous 



mixtures like Student's t distributions. Also apparent from equation (48) is that we cannot 
use this approximation directly with the stochastic approximation algorithms proposed in 
the last sections since q(x) is itself not part of the exponential family of distributions. 



However, we can rewrite (48) as 

fj = argminE 3 Jlog^(z,z) - logp(x, y,z)], (49) 

with p(x, y, z) = p(x, y)q rj (z\x), and 
q v (x\z)q v (z) 



q v (z\x) 



J qr,(x\z)q v (z)dz' 



Equation (49) now once again has the usual form of a KL-divergence minimization with an 
approximation (q v (x,z)) in the exponential family. By including the auxiliary variables z 
in the 'true' posterior density, we can thus once again make use of our efficient stochastic 
optimization algorithms. Note that including z in the posterior did not change the marginal 
posterior p(x\y) which is what we are interested in. A practical example of this approach, 



using an approximation consisting of a mixture of normals, can be found in Section 8.3| 



8 Examples 

We demonstrate our proposed methodology on three problems from the literature. 
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8.1 Binary probit regression 

Binary probit (and logistic) regression is a classic model in statistics, also referred to as 
binary classification in the machine learning literature. Here we take a Bayesian approach 
to probit regression to demonstrate the performance of our methodology relative to existing 
variational approaches. We have N observed data pairs (y,i 6 {0,1}, Xj 6 IR P ), and we 
model y\x as P(y = l|x, w) = 0(w'x) where </>(.) is the standard Gaussian cdf and w e M p 
is a vector of regression coefficients, for which we assume an elementwise Gaussian prior 
N(0, 1). This is in fact a model for which existing approaches are straightforward so it is 
interesting to compare the performance. Of course the major benefit of our approach is 
that it can be applied in a much wider class of models. 

We use data simulated from the model, with N = 100 and P = 5, to be able to show 
the performance averaged over many datasets (1000 in fact). We compare Algorithm [2] 
to the VBEM algorithm of Ormerod and Wand (2010) which makes use of the fact that 



the expectations required for this model can in fact be calculated analytically. We choose 
not to do this for our method to investigate how effective our MC estimation strategy 



can be. For completeness we also compare to variational message passing (VMP, |Winn 
and Bishop, 2006), a message passing implementation of VBEM and expectation propaga- 



tion (EP, Minka 



2001a), which is known to have excellent performance on binary classifi- 



cation problems (Nickisch and Rasmussen, 2008), both implemented in Infer.NET (Minka 



et al. , 2010) a library for probabilistic inference in graphical models. 



Since this experiment is on synthetic data we are able to assess performance in terms of 
the method's ability to recover the known regression coefficients w, which we quantify as 
the root mean squared error (RMSE) between the variational mean and the true regression 
weights, and the "log score": the log density of the true weights under the approximate 
variational posterior. The log score is useful because it rewards a method for finding good 
estimates of the posterior variance as well as the mean, which should of course be central 
to any approximate Bayesian method. 

The results, shown in Figure [2] and [3j demonstrate that our method is able to outperform 
the standard analytic VBEM algorithm in terms of speed accuracy tradeoff. The improve- 
ment in the RMSE is noticeable, but the difference in log score is dramatic, showing that 
Algorithm [2] gives significantly better estimates of the variance that VBEM. In fact our 
results are very similar to those of EP, which obtained an RMSE of 0.261 and log score of 
0.079, but took an average of 18.2 milliseconds per run (note the system set ups are not 
completely comparable: EP was run on a laptop rather than a desktop, and Infer.NET 
is implemented in C# rather than Matlab). As expected VMP gave consistent results 
with VBEM: a RMSE of 0.268 and a log score of —4.85. The average R-squared obtained 
by our variational approximation was 0.97, indicating a close fit to the exact posterior 
distribution. 
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Figure 2: RMSE approximate posterior mean - Stochastic Linear Regressions v.s. VBEM 




time (milliseconds) 

Figure 3: Log-Score of approximate posterior - Stochastic Linear Regressions v.s. VBEM 



8.2 A stochastic volatility model 

Stochastic volatility models for signals with time varying variances are considered extremely 
important in finance. Here we apply our methodology to the model and prior specified 
in Girolami and Calderhead (2011). The data we will use, from Kim et al. (1998), is the 



percentage change y t in GB Pound v.s. US Dollar exchange rate, modeled as: 
y t = e t (3exp(v t /2). 
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The relative volatilities, Vt are governed by the autoregressive AR(1) process 

v t+l = 4>v t + Ct+i, with vi ~ N[0, a 2 /{I - 2 )]. 

The distributions of the error terms are given by e t ~ N(0, 1) and £t ~ N(0, a 2 ). The prior 
specification is as in Girolami and Calderhead"| ( |2011 ): 

p(P) oc /T 1 , (<p + l)/2 ~ Beta(20, 1.5), a 2 ~ Inv-Gamma(5, 0.25) 



Following Section 7. 1 we use the hierarchical structure of the prior to suggest a hierarchical 
structure for the approximate posterior: 



q v ((f),a 2 ,f3,v) = g^(0)g ?? (cT 2 |0)g, ) (/3,t;|0, 



a 



The prior of <ft is in the exponential family, so we choose the posterior approximation q v (<j)) 
to be of the same form: 

g„[(0+l)/2] = Beta^T/a). 

The prior for a 2 is inverse-Gamma, which is also in the exponential family. We again choose 
the same functional form for the posterior approximation, but with a slight modification 
in order to capture the posterior dependency between <p and a 2 : 

q.r,(or 2 \(p) ~ Inv-Gamma(?73, t/ 4 + r? 5 2 ), 

where the extra term rj^ 2 was chosen by examining the functional form of the exact full 
conditional p(a 2 \(p, v). 

The conditional prior p[log(/3), v \<p, a 2 ] can be seen as the diffuse limit of a multivariate 
normal distribution. We therefore also use a multivariate normal conditional approximate 
posterior: 

q v i(\og([3),v)\<j ) ,a 2 } = N(m,V), 

with 

V' 1 = P(0, a 2 ) + rje and m = V~V 
where P(<f>, cr 2 ) is the precision (inverse covariance) matrix of p[(\og(/3),v)\<p,a 2 }, r] e is a 



T x T matrix, and r\i is a T x 1 vector. Furthermore, an analysis following Opper and 



Archambeau (2009) shows that only a relatively small number of the elements of r/ 6 will 



be non-zero: all elements on the diagonal of t)q and all elements in the column and row 
belonging to log(/3). 

Using the GB Pound vs US Dollar exchange rate data, the approximation above has almost 
2000 free variational parameters to be optimized. This seems like a problematically large 
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number, but is easily feasible by using algorithm [2]to fit q v [\og((3),v\(fi, a 2 ] and the algorithm 
using only gradients (Section 4.3) to fit ^(0) and q v (a 2 \(f>). Expectations and normalizing 
constants for q ri [\og((3),v\(p, a 2 } can be calculated efficiently using the Kalman filter and 



smoother (see e.g. Durbin and Koopman 2001). For the current application we therefore 
only need to sample and a 2 each iteration, and not (3 and v. 



We compare the results against the "true" posterior, provided by a very long run of the 



MCMC algorithm of Girolami and Calderhead (2011). 
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Figure 4: Exact and approximate posterior for the stochastic volatility model - (3 parameter 
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Figure 5: Exact and approximate posterior for the stochastic volatility model - (j) parameter 
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Figure 6: Exact and approximate posterior for the stochastic volatility model - a 2 parameter 



As can be seen from Figures |4j [5] and [6j the posterior approximations are nearly exact. 
The posterior approximation for (3 seems especially good (Figure [4]), which is due to (3 
being in the last block of the hierarchical posterior approximation. Similarly, the posterior 
approximation for the latent volatilities v (not shown) are also indistinguishable from the 
exact posterior. 

Initializing q v (<f)) and q v {& 2 \<fi) to the prior, the above results can be obtained using 500 
iterations of our algorithm, with a single (<fr, a 2 ) sample per iteration. Using these settings, 
the single-threaded Matlab implementation of our stochastic optimization algorithm re- 
quires just under a second to complete on a 3Ghz processor. This is more than two orders 
of magnitude faster than the running time required by advanced MCMC algorithms for 
this problem. 

Our approach to doing inference in the stochastic volatility model shares some similarities 



with the approach of Liesenfeld and Richard (2008). They fit a Gaussian approximation to 



the posterior of the volatilities for given 0, a , f3 parameters, using the importance sampling 



algorithm of Richard and Zhang (2007), which is based on auxiliary regressions somewhat 



similar to those in Algorithm [TJ They then infer the model parameters using MCMC meth- 
ods. The advantage of our method is that we are able to leverage the information in the 
gradient and Hessian of the posterior, and that our stochastic approximation algorithm 
allows us to fit the posterior approximation very quickly for all volatilities simultaneously, 
while their approach requires optimizing the approximation one volatility at a time. Unique 
to our approach is also the ability to concurrently fit a posterior approximation for the 
model parameters <f), a 2 , (3 and have the approximate posterior of the volatilities depend on 



these parameters, while Liesenfeld and Richard (2008) need to re-construct their approxi 



mation every time a new set of model parameters is considered. As a result, our approach 
is significantly faster for this problem. 
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8.3 A beta-binomial model for overdispersion 



Albert (2009, Section 5.4) considers the problem of estimating the rates of death from 
stomach cancer for the largest cities in Missouri. This cancer mortality data is available 



from the R package LearnBayes, and consists of 20 pairs 



3i 



Vs. 



where n,- contains the 



number of individuals that were at risk in city j, and yj is the number of cancer deaths 
that occurred in that city. The counts jjj are overdispersed compared to what one could 
expect under a binomial model with constant probability, so Albert (2009) assumes the 
following beta-binomial model with mean m and precision K 



P(yj\m,K) 



tij\ B(Km + yj, K(l — m) + rij — yj) 



Vj 



B{Km,K{l - m)) 



where B(-,-) denotes the Beta-function. The parameters m and K are given the following 
improper prior 



p(m, K) oc 



1 



1 



m(l -m) (1 + K) 2 ' 

The resulting posterior distribution is non-standard and extremely skewed. In order to 



ameliorate this, Albert (2009) proposes to use the reparameterization 



6>i = logit(m), and 9 2 = log(i^). 

The form of the posterior distribution p(8\n,y) still does not resemble any standard dis- 
tribution, so we will approximate it using a finite mixture of L bivariate Gaussians. In 
order to do this, we first introduce an auxiliary variable z, to which we assign a categorical 
approximate posterior distribution with L possible outcomes. 

log q v (z) = 8(z = l)r/i + 5(z = 2)r/ 2 H h S(z = L)r} L , 

where S() is an indicator function. 

Conditional on z, we then assign 9 a Gaussian approximate posterior 



q v (6\z 



By adapting the true posterior as described in Section 7.2 we can fit this approximate 



posterior to p(9 \n, y) . We do this by using the basic algorithm of Section 4. 1 The regression 
statistics C and g used in the resulting algorithm depend linearly on the indicator vector 
5{z* = i), which denotes whether or not component i was used to sample 9* in each 
iteration. Rather than using this indicator function directly, we use its Rao-Blackwellized 
version E[5(z* = where 9* are the sampled parameters. The resulting stochastic 

estimates will have the same expectation as when using 5(z* = i) itself, but with lower 
variance at no additional computational cost. 
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We fit these approximations using a varying number of mixture components L and examine 
the resulting KL-divergence from the true posterior density. Since this is a low dimensional 
problem, we can obtain this divergence very precisely using quadrature methods. This also 
allows us to assess the accuracy of the KL-divergence approximation derived in Section [6] 
Finally, we present a contour plot that visually shows that a good approximation can 
indeed be obtained using a large enough number of mixture components. 
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Figure 7: KL-divergence between the variational approximation and the exact posterior density 
for an increasing number of mixture components. 
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Figure 8: Contour plots of posterior approximations using 1-8 mixture components, with the 
exact posterior at the bottom-right. 



Figures [7] and [8] show that we can indeed approximate this skewed and fat-tailed density 
very well using a large enough number of Gaussians. The R-squared of the mixture approx- 
imation with 8 components is 0.997. Also apparent is the inadequacy of an approximation 
consisting of a single Gaussian for this problem, with an R-squared of only 0.82. This 
clearly illustrates the advantages of our approach which allows us to use much richer types 
of approximations than was previously possible. Furthermore, Figure [7] shows that the 
KL-divergence of the approximation to the true posterior can be approximated quite accu- 
rately using the measure developed in Section |6j especially if the posterior approximation 
is reasonably good. 



9 Conclusion and future work 

We have introduced a stochastic optimization scheme for variational inference inspired by 
a novel interpretation of fixed-form variational Bayes as linear regression of the target log 
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density against the sufficient statistics of the approximating family. Our scheme allows very 
generic implementation for a wide class of models since in its most basic form only the 
unnormalized density of the target distribution is required, although we have shown how 
gradient or even Hessian information can be used if necessary. The generic nature of our 
methodology would lend itself naturally to a software package for Bayesian inference along 



the lines of Infer.NET (Minka et al. 2010) or WinBUGS (Gilks et al. 1994), and would 



allow efficient inference in a considerably wider range of models. Incorporating automatic 
differentiation in such a package could clearly be beneficial. Automatic selection of the 
approximating family would be very appealing from a user perspective, but is currently 
still an open research problem. 

Variational inference usually requires that we use conditionally conjugate models: since 
our method removes this restriction several possible avenues of research are opened. For 
example, collapsed versions of models (i.e. with certain parameters or latent variables inte- 



grated out) sometimes permit much more efficient MCMC inference (Porteous et al. , 2008) 
but adapting variational methods to work with collapsed models is complex and requires 



custom per model methodology (Teh et al. , 2007). However, our method is ambivalent to 



whether the model is collapsed or not, so it would be straightforward to experiment with 
different representations of the same model. 

We have shown it is straightforward to extend our methodology to use hierarchical struc- 
tured approximations and more flexible approximating families such as mixtures. This 
closes the gap considerably relative to MCMC methods. Perhaps the biggest selling point 
of MCMC methods is their asymptotic guarantees: in practice this means simply running 
the MCMC chain for longer can give greater accuracy, an option not available to a practi- 
tioner using variational methods. However, if we use a mixture approximating family then 
we can tune the computation time vs. accuracy trade off simply by varying the number of 
mixture components used. Another interesting direction of research along this line would 
be to use low rank approximating families such as factor analysis models. 

It should be noted that it is possible to mix our method with VBEM, for example using 
our method for any non-conjugate parts of the model and VBEM for variables that happen 
to be conjugate. This is closely related to the non-conjugate variational message passing 



(NCVMP) algorithm of Knowles and Minka (2011), implemented in Infer.NET, which 
aims to fit conjugate models while maintaining the convenient message passing formalism. 
Note that NCVMP only specifies how to perform the variational optimization, not how to 
approximate required integrals: in Infer.NET quadrature or secondary variational bounds 
are used where analytic expectations are not available, unlike the Monte Carlo approach 
proposed here. 
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A Bias and variance calculations 



We first consider the bias of the estimators. Consider the multivariate Taylor expansion of 
/ : R K ->• R around the point y eR K : 



f(y) ~ M + (y- y)'f{y) + ^r((y -y){y- y)V 2 f(y)) 
From this we can derive expressions for the expectation of f(y): 



(50) 



e/ ~ f(v) + 2 tr ( Cov (^)/"^)) 



(51) 



where we have chosen to perform the Taylor expansion around the mean y = Ey. For the 
first estimator let y — ^ J2 S a ( x s) and f(y) = 1/y, then we find 



Efj! = E 



1 Var(a) 



E[6] 



= E[ v ] + 



E[a] ' SE[af 
Var(a)E 



E[6] 



SEla} 3 



(52) 

(53) 
(54) 



since Var(y) = V&r(a)/S. We see that the bias term depends on the ratio Var(a)/E[a] 2 , 
i.e. the spread of the distribution of a relative to its magnitude. 
Now for the second estimator let 



V 



(55) 



so that T] 2 = f(y) = Note that Cov(y) = | Cov([a,6]') and 



v 2 /(y) = 



i 

'vl 







(56) 



Putting everything together we have 

Var(a)E6 Cov(a, b) 



Et] 2 « Er] + 



5E[a] 



SE[a] 



(57) 



Note that we recover the expression for Er)i if Cov(a, b) = 0, which makes sense because 
if we use different randomness for calculating E[a] and E[b] then a, 6 have covariance in 
our MC estimate. Finally the analytic estimator is unbiased: 



Ef) a = E V 



(58) 



32 



We now turn to the variances. The analytic estimator is a standard MC estimator with 
variance 



Var(i7 a ) 



Var(o) 



SE(a) 2 

Consider only the linear terms of the Taylor expansion: 

f(y)*m + (y-y)'f(y) 

Substituting this into the formula for variance gives 

Var[/(y)] = E[(/(y) - E/(y))(/(y) - E/(y))'] 
«E[/'(y)'(y-y)(y-y)r(^)] 
= / / (y)'Var(y)/'(y) 



(59) 



(60) 



(61) 
(62) 
(63) 



We will calculate the variance of the second estimator and derive the variance of the first 



estimator from this. Again let y be as in (55). Note that Var(y) = Gov(a,b)/S. We find 
1 



. (Eo) 2 Vara EbCov(a,b) Var 6 
Var V2 ~ S \ (Ea)* ~ 2 (Ea)3 + (E^ 



(64) 



The final term is equal to that for the analytic estimator. The second term is not present 
in the variance of the first estimator, since then a and b have no covariance under the 
sampling distribution, i.e. 



Var?}! 



1 /(E6) 2 Vara Var b 



S 



(Ea) 4 



(Ea) 



(65) 



The first term is always positive, suggesting that fjx is dominated by the analytic estimator. 
If we make the assumption that p(x) is in the same family as q then we have \ogp(x) = 
XT(x), then we find 



Ea = E[T 2 ] 
Eb w AE[T 2 ] 
Cov(a,6) w AE[T 4 ] - AE[T 2 ] 2 = r7Var(T 2 ) 
Var (a) = E[T 4 ] - E[T 2 ] 2 = Var(T 2 ) 



(66) 
(67) 
(68) 
(69) 



B Unnormalized to normalized optimality condition 

The unnormalized optimality condition in (J7|) is 



qrj(x)T(x)'T(x)du(x) 



-i 



qfj(x)T(x)' \ogp(x, y)dv(x) 



(70) 
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Clearly we can replace q by its normalized version q(x) = q/Z(rf) since the normalizing 
terms will cancel. Recalling T(x) = (l,T(x)) and fj = (t]o,T]')' we then have 



1 ET 
ET' E[T'T] 



-i 



EY 
E[TY] 



r) 



where Y := log p(x,y). Rearranging gives 



EY 
E[TY] 



1 ET 
ET' E[T'T] 



V 



Solving for r] easily gives 



r] = EY - E[T]r) 

rf = (E[T'T] - ET'ET)' 1 (E[TY] - ETEY) 
= Cov(T)- 1 Cov(T,y) 



(71) 



(72) 



(73) 
(74) 
(75) 



C Derivation of Gaussian variational approximation 

For notational simplicity we will derive our stochastic approximation algorithm for Gaus- 
sian variational approximation (algorithm [2]) under the assumption that x is univariate. 
The extension to multivariate x is conceptually straightforward but much more tedious in 
terms of notation. 

Let p(x, y) be the unnormalized posterior distribution of a univariate random variable x, 
and let q(x) = N(m, V) be its Gaussian approximation. In order to find the mean m 
and variance V that minimize the KL-divergence between q(x) and p(x\y) we solve the 
transformed regression problem defined by 



with = (m, V)' the usual mean- variance parameterization and where the natural param- 



eters are given by rj = (V 1 m, V 1 ) / . Following equation (43 ), we know that the regression 
statistics for this optimization problem are given by 

K(rj) Cov 9 jT(x),T(x)] = V^T(x) (76) 
with T(x) = (x, — 0.5x 2 )' the vector of sufficient statistics, and 

K( V ) Cov q ^[T(x),logp(x,y)\ = V E^ logp(x,y). (77) 



Recall identity (39) which states that 
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V^E^x) = E q ^V x g(x), 

with 0i = m the first element of the parameter vector 0, and g(x) any differentiable 
function. Similarly, identity (40) reads 



V^Eq g(x) 



1 



E q .V x V x g(x) 



V the second element of the parameter vector. Using these identities to stochas- 



with (f) 

tically approximate the regression statistic (1771) we obtain 



g x = V x logp(x,y)\ x=x *, and g 2 = --V x V x hgp(x,y)\ x=x * 



with x* a draw from q<f,(x). We can approximate the regression statistic (76) in a similar 
way. Making use of the fact that T(x) = (x, — 0.5x 2 )' is the vector of sufficient statistics, 
we obtain 





= 1 


Cl,2 


= -x* 




= 


C*2,2 


= -0.5 



(78) 

Note that Ci^ is the only stochastic entry in the matrix, which means we only have to 
store and average the samples x*, rather than the full C matrix. We will denote the 
average of these samples by z. Storing the 2-dimensional g statistic is equivalent to storing 
the averaged gradient and Hessian of logp(x,y), which we denote by a and P. Since our 
estimate C is upper-diagonal, the final regression step fj = C~~ x g from algorithm [l] can 
easily be expressed analytically in terms of these quantities, which gives V = P^ 1 and 
m = Va + z as the optimal mean and variance of the approximation. 



D Stochastic linear regression and gradient descent 

At each iteration of Algorithm [TJ the current variational parameters r] t are stochastically 
updated to a new value T)t+i- In Section [3] we derive this update from the viewpoint of 
linear regression. In this section we show that, for very small step-sizes, the update is also 
similar to a preconditioned stochastic gradient descent step. As in Algorithm [TJ we have 
rjt = C^ 1 g t , and we update this to 



35 



r) t+1 = [(1 - w t )C t + w t C t ] - w t )g t + w t g t ] = [C t + A t Cy-% + X t g t ], 

where c/t and Ct are the stochastic estimates generated during iteration t, and where At = 
w t /(l — w t ) is the effective step-size of the update. To characterize this update for small 
values of X t we perform a first order Taylor expansion of rjt+x around X t = 0, which gives 



Vt+i = Vt 



XtC^iCtVt-g^ + OiXl 



(79) 



Comparison with equation ([6j shows that the stochastic term in this expression [C t Vt — gt) 
is an unbiased estimate of the gradient of the KL-divergence D[q Vt (x)\p(x,y)]. Up to first 



order, the update equation in (79) thus represents a stochastic gradient descent step, pre- 



conditioned with the C t 1 matrix. Since this pre- conditioner is independent of the stochastic 
gradient approximation at iteration t, this gives a valid adaptive stochastic gradient descent 



algorithm, to which all the usual convergence results apply (see e.g. Amari, 1997). Note the 



connection between this result and the result of equation (18): there we show that the bias 



disappears as the number of samples grows large, while here we show that it disappears 
as the step-size gets small. In Section [3] we further show that the effect of this bias on the 
update r] t+ i disappears faster than the effect of the variance. 



If we take small steps, the pre-conditioner C t of equation (79) will be close to the Rie- 
mannian metric E gt Ct = Var gt [T(x)] used in natural gradient descent algorithms like that 



of Honkela et al. (2010). For certain exponential family distributions this metric can be 



calculated analytically, which suggests performing natural gradient descent optimization 
with updates of the form 



Vt+i = Vt ~ X t (r) t - V&r qt [T(x)] 1 Gov qt [T(x), log p(x,y)]) 



where the covariance Gov qt [T(x), log p(x,y)] is approximated using Monte Carlo. While 
this approach is clearly related to Algorithm [l] there are two significant differences: 1) this 
would correspond closely to the analytic estimator in (12), which we have seen can both 



theoretically and empirically be improved on by using the linear regression estimator (11 ). 
This means that the O(X^) term in equation (79) is generally beneficial. 2) Algorithm I] 



efficiently averages the MC samples over multiple iterations so that just a single MC sample 
per iteration is required. This is statistically more efficient than performing averaging in 
the space of the rj parameters, as is done in stochastic gradient descent. 
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